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Abstract - Community structure in networks has been investigated from many view- 
points, usually with the same end result: a community detection algorithm of some 
kind. Recent research offers methods for combining the results of such algorithms 
into timelines of community evolution. This paper investigates community detection 
and tracking from the data fusion perspective. We avoid the kind of hard calls made 
by traditional community detection algorithms in favor of retaining as much uncer- 
tainty information as possible. This results in a method for directly estimating the 
probabilities that pairs of nodes are in the same community. We demonstrate that 
this method is accurate using the LFR testbed, that it is fast on a number of standard 
network datasets, and that it is has a variety of uses that complement those of stan- 
dard, hard-call methods. Retaining uncertainty information allows us to develop a 
Bayesian filter for tracking communities. We derive equations for the full filter, and 
marginalize it to produce a potentially practical version. Finally, we discuss closures 
for the marginalized filter and the work that remains to develop this into a principled, 
efficient method for tracking time- evolving communities on time-evolving networks. 

Keywords: Community detection, community tracking, Bayesian filter, co-membership proba- 
bility, dynamic stochastic blockmodel. 

1 Introduction 

The science of networks has a large and multidisciplinary literature. Freeman traces the 
sociological literature on networks from its pre-cursors in the 1800s and earlier, through 
the sociometry of the 1930s and Milgram's "Small Worlds" experiments in the 1960s, to 
its current form [29]. Sociologists and statisticians introduced the idea of defining network 
metrics: simple computations that one can perform on a network, accompanied by argu- 
ments that explain their significance: e.g., the clustering coefficient and various measures 
of network centrality [77]. What Lewis calls the "modern period" of network science [50] 
began in 1998 with the influx of physicists into the field (e.g., Barabasi and Newman). 
The physicists brought novel interests and techniques (power laws, Hamiltonians, mean field 



approximation, etc.), particularly from statistical physics, along with an overarching drive 
toward universality — properties of network structure independent of the particular nature 
of the nodes and links involved [HS]- Mathematicians have their own traditions of graph 
theory [5J, and, in particular, random graph theory (33 [TO] which emphasizes rigorously for- 
mulated models and what properties the graphs they produce have (with high probability) 
in different asymptotic regions of the models' parameter spaces. Finally, computer scientists 
have developed a wide variety of efficient network algorithms [15J , and continue to contribute 
broadly because ultimately the processing of data into usable results is always accomplished 
via an algorithm of some kind, and because solid computer science is needed for processing 
megascale, real- world networks. 

Each of the above communities brings important, complementary talents to network 
science. The data fusion community has important perspectives to offer too, due both to the 
broad range of practical issues that it addresses, and to characteristics of the mathematical 
techniques it employs [2J. 

The defining problem of data fusion is to process data into useful knowledge. These 
data may be of radically different types. One might consider a single data point to be, e.g., 
the position estimate of a target, a database record containing entries for various fields, an 
RDF triple in some specified ontology, or a document such as an image, sound, or text file. 
Data mining deals with similar issues, but focuses on the patterns in and transformations 
of large data sets, whereas data fusion focuses on distilling effective situational awareness 
about the real world. A central paradigm in data fusion is the hierarchy of fusion levels 
needed to transform raw data into high-level knowledge, the most widespread paradigm 
being the JDL (Joint Directors of Laboratories) hierarchy [69]. In this paradigm, level 
comprises all the pre-processing that must occur before one can even refer to "objects." In 
some cases it is clear that a data point corresponds to some single object, but it is unclear 
which object: this is an entity resolution problem [6]. In other data point contains 

information about multiple objects, and determining which information corresponds to which 
object is a data association problem [531 125]. Processing speech or images requires solving a 
segmentation problem to map data to objects [30J , and natural language processing involves 
a further array of specific techniques (Named Entity Recognition, etc.). One benefit of a 
data-fusion approach to network science is its careful consideration, at level 0, of how to 
abstract representations from raw data. In the network context, this applies not just to 
nodes (i.e., objects), but to the links between them. Sometimes one imposes an arbitrary 
threshold for link formation; sometimes multi-node relationships (i.e., hypergraph edges) are 
replaced with edges between all nodes involved. Edges can have natural direction, weights, or 
types (and nodes may have attributes) that are retained or ignored in a graph representation. 
When data are inappropriately shoehorned into a network format, or important node or link 
attributes are ignored, then the results derived from that graph representation may be much 
less powerful than they could be, or even completely misleading. 

Level-0 data fusion encompasses these pre-processing techniques, drawn from computer 
science, data mining, and the domains specific to the data being considered. Higher-level 
fusion (say, JDL levels 3-5) addresses another set of issues important to a complete theory of 
network science. These issues relate to human knowledge and intent. Just as level-0 fusion 
has similarities with the computer-science approach to networks, higher-level fusion has 
some overlap with the sociological approach. Levels 1 and 2, on the other hand, correspond 
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loosely to the more theoretical approaches of mathematics and physics. Level 1 addresses the 
detection, state estimation, and tracking of individual objects [17]; whereas level 2 broadens 
the scope to tracking groups of objects [22] and to the general assessment of multiple-object 
situations [7J. In data fusion, however, the overriding problem is how to achieve cohesion 
between the various levels [32J. Achieving such cohesion would be a valuable contribution 
to network science. 

This paper addresses a specific network problem from the data fusion perspective. Over 
the past decade, there has been a great deal of work on the community detection problem [28J : 
discerning how a graph's nodes are organized into "communities." There is no universally 
accepted definition of community structure: it can correspond to some unobserved, ground- 
truth organizational structure; it can refer to some attribute that nodes share that drives 
them to "flock" together [52]; or communities can be defined as sets of nodes more densely 
connected to each other than to the rest of the graph. Whatever the definition of commu- 
nity structure, it nearly always results in communities being densely connected subsets of 
nodes (the Newman-Leicht algorithm being a notable exception [57]). In practice, studies 
of community structure in graphs (e.g., [15]) define a community to be, in effect, the output 
of a community detection algorithm. Weighted and/or directed edges are allowed in some 
methods, but accounting for more general features on nodes and/or edges is problematic for 
network research because this information tends to be domain-specific. 

Community detection is nearly always formulated in terms an algorithm which ingests a 
network and outputs some indication of its community structure. With a few exceptions, 
community detection algorithms produce a single, hard-call result. Most often this result is a 
partition of nodes into non-overlapping communities, but a few algorithms produce overlap- 
ping communities (e.g., CFinder [HO]), and some produce a dendrogram — i.e., a hierarchy of 
partitions [SB] • The dominant framework for finding the best partition of nodes is to specify 
some quality function of a partition relative to a graph and seek to maximize it. Methods 
that maximize modularity [56J (explicitly or implicitly) are among the most numerous and 
successful today. 

From a data fusion perspective, however, it is important to assess the uncertainty associ- 
ated with community detection. Quality functions such as modularity are only motivated by 
intuition or physical analogy, whereas probability is the language of logical reasoning about 
uncertainty [38J. The reason principled fusion of disparate data types is possible is that 
one can posit an underlying model for reality, along with measurement models that specify 
the statistics of how this reality is distorted in the data. One can then update one's prior 
distribution on reality to a posterior via Bayesian inference [75] . 

There are some methods that formulate community detection as an inference problem: 
a prior distribution over all possible community structures is specified, along with a likeli- 
hood function for observing a graph given a community structure. Hastings, for example, 
formulated the community detection problem in terms of a Potts model that defines the 
Hamiltonian H for a given graph-partition pair, and then converted this to a probability 
proportional to e~ l3H [33]. Minimizing H therefore yields the MAP (Maximum A posteriori 
Probability) partition for given structural parameters of the Potts model. Hofman and Wig- 
gins extended this approach by integrating the structural parameters against a prior [31]. In 
both cases, if all one does with the posterior probability distribution is locate its maximum, 
then it becomes, in effect, just another quality function (albeit a principled one). On the 
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other hand, the entire posterior distribution is vast, so one cannot simply return the whole 
thing. The question, then, is what such a probability distribution is good for. 

Clauset et al. made greater use of the posterior distribution by devising a Monte Carlo 
method for generating dendrograms, and using it to estimate the probabilities of missing 
links [12]. Reichardt and Bornholdt employed a similar Monte Carlo method to estimate the 
pairwise co-membership probabilities p^ v < w ^ between nodes [611 E2] , where defined 
to be the probability that nodes v and w are in the same community. The set of all p^ v ' w ^ 
is much smaller than the full posterior distribution, and thus provides a useful, if incom- 
plete, summary of the uncertainty information. It is expensive to compute exactly, however. 
Therefore we will derive an accurate approximation with which to summarize uncertainty 
information for community structure more efficiently than Monte Carlo methods. 

A key benefit of retaining uncertainty information is that it enables principled track- 
ing [70] . We may track time- varying communities in time- varying graph data by deriving an 
efficient Bayesian filter for tracking time- varying communities from time- varying graph data. 
The term "filter" is somewhat strange in this context: the original, signal-processing context 
of filters (e.g., the Wiener filter [78]) was that of algorithms which filter out noise in order 
to highlight a desired signal. The Kalman filter changed this framework to one of distinct 
state and measurement spaces |40j. This was soon generalized to the concept of a Bayesian 
filter [31]. To develop a Bayesian filter, one constructs (a) an evolution model over a state 
space that specifies the probability distribution of the state at some future time given its 
value at the current time, and (b) a measurement model that specifies the probability distri- 
bution of the current measurement given the current state. Thus, despite the connotations 
of the word "filter," a Bayesian filter can have quite different state and measurements spaces. 
To track communities, a model for the co-evolution of graphs and community structure will 
be constructed, and the measurement model will be that only the graph component of the 
state is observable. 

In Section [2] we derive exact inference equations for the posterior probabilities of all 
possible community structures for a given graph. This result is essentially the same as can 
be found elsewhere (e.g., [331 [3H HI] ) , but is included here in order to introduce notation and 
clarify subsequent material. In Section [3] we derive an approximation of the co-membership 
probabilities p^ v ' w ^ based on using only the most important information from the graph. The 
pi v M matrix provides the uncertainty information that the usual hard-call algorithms lack. 
In Section [4] we demonstrate that the p^ v ' w ^ approximation is accurate and also surprisingly 
efficient: despite the fact that it provides so much information, it is significantly faster than 
the current, state-of-the-art community detection algorithm (Infomap [Ml S3])- We also 
demonstrate the uses for this alternative or supplemental form of community detection, which 
are embodied in the software IGNITE (Inter-Group Network Inference and Tracking Engine). 
One benefit of maintaining uncertainty information is that it allows principled tracking. In 
Section [5] we present a continuous-time Markov process model for the time-evolution of both 
the community structure and the graph. We then derive an exact Bayesian filter for this 
model. The state space for this model is far too large to use in practice, so in Section [6] 
we discuss efficient approximations for the exact filter. The community tracking material is 
less developed than the corresponding detection material: there are several issues that must 
be resolved to develop accurate, efficient tracking algorithms. However, we believe that the 
principled uncertainty management of the data fusion approach provides a framework for 
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the development of more reliable, robust community tracking methods. 

2 Community Detection: Exact Equations 

Suppose that out of the space K of all possible networks on n nodes we are given some 
particular network ft. If we have some notion of a space $ of all possible "community 
structures" on these n nodes, then presumably the network k provides some information 
about which structures are plausible. One way to formalize this notion is to stipulate a 
quality function Q : K x <E» — > K that assigns a number to every network-structure pair 
(ft, <fi). It would be natural, for example, to define quality as a sum over all node pairs {v, w} 
of some metric for how well the network and community structure agree at {v, w}. That is, in 
the network k, if {v, w} is a link (or a "strong" link, or a particular kind of link, depending 
on what we mean by "network"), then it should be rewarded if <ft places v and w in the 
same community (or "nearby" in community space, or in communities consistent with the 
observed link type, depending on what we mean by "community structure"). Modularity is 
a popular, successful example of a quality function [56]. Quality functions are easy to work 
with and can be readily adapted to novel scenarios. However, the price of this flexibility 
is that unless one is guided by some additional structure or principle, the choice of quality 
function is essentially ad hoc. In addition, the output of a quality function is a number that 
provides nothing beyond an ordering of the community structures in $. The "quality" itself 
has little meaning. 

One way to give quality functions additional meaning is to let them represent an energy. 
In this case, the quality function may be interpreted as a Hamiltonian. The qualities assigned 
to various community structures are no longer arbitrary scores in this case: meaningful 
probabilities can be assigned to community structures can be computed from their energies. 
The language of statistical physics reflects the dominance of that field in network science [28J , 
but from a fusion standpoint it is more natural to dispense with Hamiltonians and work 
directly with the probabilities. A probabilistic framework requires models: these necessarily 
oversimplify real-world phenomena, and one could argue that specifying a model is just 
as arbitrary as specifying a quality function directly. However, the space of probabilistic 
models is much more constrained than the space of quality functions, and, more importantly, 
formulating the problem in terms of a formal probability structure allows for the meaningful 
management of uncertainty. For this reason, modularity and other quality functions tend 
to be re-cast in terms of a probability model when possible. For example, the modularity 
function of Newman and Girven [56] was generalized and re-cast as the Hamiltonian of a 
Potts model by Reichardt and Bornholdt [62], while Hastings demonstrated that this is 
essentially equivalent to inference (i.e., the direct manipulation of probability) [33J. 

A probabilistic framework for this community structure problem involves random vari- 
ables K for the graph and $ for the community structure. We require models for the prior 
probabilities Pr($ = 0) for all G $ and for the conditional probability Pr(K = ft|$ = <f>) for 
all ft G K and <fi G (We will typically use less formal notation such as Pr(0) and Pr(ft|</>) 
when convenient.) Bayes' theorem then provides the probability Pr($ = <p\K = ft) of the 
community structure <fi given the graph data ft. The models Pr(0) and Pr(«|0) typically 
have unknown input parameters jt, so that the probability given by Bayes' theorem could 
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be written Pr(</>|/«, //). This must be multiplied by some prior probability Pr(/2) over the 
parameter space and integrated out to truly give Pr(0|ft) [31]. A simpler, but non-rigorous, 
alternative to integrating the input parameters against a prior is to estimate them from the 
data. This can be accurate when they are strongly determined by the graph data: i.e., when 
Pr(/7|ft) is tightly peaked. The issue of integrating out input parameters will be addressed 
in Section |3j but for now we will not include them in the notation. 

Section 2.1 will derive Pr(«|0) using a stochastic blockmodel [20] with multiple link types 
for k. In Section 2.2, this will be simplified to the special case of a 'planted partition [13] 
model in which links are only "on" or "off." 



2.1 Stochastic blockmodel case 

Let m denote the number of communities ; and r be the number of edge types. The notation [p] 
will denote the set of integers {1,2, ... ,p}, and [p} denote the zero-indexed set {0, 1, ... ,p — 
1}. We will let [n] denote the set of nodes; [m], the set of communities; and [r] , the set of 
edge types. Let denote the set of (unordered) pairs of a set S so that [n]^ denotes the 
set of node pairs. It is convenient to consider [n]^ to be the set of edges: because there are 
an arbitrary number of edge types r, one of them (type k = 0) can be considered "white" or 
"off." Thus, all graphs have N = n{n — l)/2 edges, but in sparse graphs most of these are 
the trivial type k = 0. 

The community structure will be specified by a community assignment (j) : [n] — > [m], 
i.e., a function that maps every node v G [n] to a community <p(v) G [m]. The graph will 
be specified as a function k : [n]' 2 ' — > [r] , which maps every edge e G [n]' 2 ' to its type 
n(e) G [r] . (This unusual notation n will be replaced with the more usual G when dealing 
with the r = 2 case: i.e., when there is only edge type aside from "off.") 

The stochastic blockmodel H(n, p, Q) is parametrized by the the number of nodes n, the 
stochastic m- vector p, and a collection Q of stochastic r- vectors q^- [JI]. Here "stochastic 
m- vector" simply means a vector of length m whose components are non-negative and sum to 
one. The vector p comprises the prior probabilities Pi of a node belonging to the community 
i G [m] — the communities for each node are drawn independently. For 1 < i < j < m, the 
vector qij comprises the probabilities of an edge between nodes in communities i and j 
being of type k — the types of each edge are drawn independently once the communities of 
the nodes are given. (For i > j, let = q^: i.e., the edges are undirected.) The model 
H(n, p, Q) defines the random variables $ and K whose instances are denoted and k, 
respectively. The derivation of Pr($ = 0|K = k) proceeds in six steps. 

Step 1. The probability that a node v belongs to the community <f)(v) is, by definition, 

PT($(y)=<f>(v))= P(Kv) . (2.1) 

Step 2. The probability that an instance of $ is the community assignment equals 

/i 

Pr($ = 0) = n^(,) (2-2) 
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because the communities of each node are selected independently. 

Step 3. For a fixed value of the probability that the edge e = {v,w} has type n(e) is, 
by definition, 

Pr(K(e) = «(e)|$ = 0) = g<^),</,(«,), K ( e )- (2.3) 

Step 4. For a fixed value of of $, the probability that an instance of K is the graph k 
equals 

Pr(K = K|$ = 0) = JJ g^( ei )^(e 2 ), K (e), (2.4) 
ee[n]{2} 

because the types of each edge are selected independently given 0. 
Step 5. The probability of a specific assignment and graph k equals 

n 

Pr($ = 0,K = K) = X\p<t>(v) JJ g*(ei),*(ea),K(e), (2-5) 
f=l eG[n]{2> 

because Pr(0, «) = Pr(/t|0)Pr(0). 

Step 6. Finally, the posterior probability of $ = for a given graph k is 

Pr($ = 0|K = k) oc Pr($ = 0, K = k), (2.6) 
where the constant of proportionality is l/Pr(K = k). 

2.2 Planted partition case 

In many applications one does not have any a priori knowledge about specific communities. 
In such cases, the community labels [to] are arbitrary: the problem would be unchanged if 
the communities were labeled according to another permutation of [to]. Thus, if one has a 
prior distribution over p and Q (as in |34j), then that distribution must be invariant under 
permutations of [to]. In the case of fixed input parameters p and Q, this translates to 
p and Q themselves being invariant under permutations. Making this simplification, and 
considering only r = 2 edge types ("off" (k = 0) and "on" {k = 1)) yields the special case 
called the planted partition model [H]. In this case, symmetry implies that pi = 1/m for 
all i G [to], and that q^i = pi for i — j and g^i = po for i ^ j. Here pi denotes the edge 
probability between nodes in the same community, and po, the edge probability between 
nodes in different communities. Thus, the to + to(to + l)(r — l)/2 input parameters of 
H(n, p, Q) reduce to only four to give the planted partition model H(n, m,p I ,po). 

Having only two edge types suggests using the standard notation G to denote a graph, 
with E(G) denoting the set of ("on") edges. The symmetry of the community labels implies 
that Pr(0|/«) is invariant under permutations of [to], so that is more efficient to formulate 
the problem in terms of a partition n of the nodes into communities rather than (because 
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partitions are orbits of community assignments under permutations of [m]). We may then 



replace (2.5) by 

Pr(n,G) = -PiT'Poa -PoY°. (2.7) 

Here \ir\ denotes the number of (non-empty) communities in the partition tt, and (m)k denotes 
the falling factorial m!/(m — A;)!, which counts the number of assignments <fi represented by 
the equivalence class tt. The number of edges between nodes in the same community is 



denoted ej(G) (abbreviated to ej in (2.7)), and the number of non-edges (or "off" edges) 
between nodes in the same community is denoted ei{G). The analogous quantities for 
nodes in different communities are eo{G) and e~o{G). The posterior probability Pr(7r|G) is 
proportional to Pr(7r,G). 

3 Community Detection: Approximate Methods 

Community detection methods that employ quality functions return hard calls: an optimiza- 
tion routine is applied to determine the community structure that maximizes the quality 
f(K, 0) over all G $ for a given graph k. There is little else one can do with a quality 
function: one can return an ordered list of the fc-best results, but a probability framework 
is required to interpret the relative likelihoods of these. 



In contrast, the formulas (2.5) and (2.7) provide the information necessary to answer 
any statistical question about the community structure implied by k. Unfortunately, an 
algorithm that simply returns the full distribution is grossly impractical. The number of 
partitions of n nodes is the Bell number B(n), which grows exponentially with n: e.g., 
5(60) ~ 10 60 . What, then, are these probabilities good for? One answer is that the formula 
for posterior probability can be used as a (more principled) quality function |33| . Another 
is that Monte Carlo methods can be used to produce a random sample of solutions [HU [12] . 
These random samples can be used to approximate statistics of In this section we will 
consider how such statistics might be computed directly. 

3.1 Stochastic blockmodel 

The most natural statistical question to ask is this: what is the probability that a node v is 
in community il We may express this probability as p\ = Pr($(t>) = i\K = k), where the 
dependence on the graph n is suppressed from the notation. For the model H(n, p, Q), we 



may compute p\ from (2.5): 



Pt(k) 

<j}{v) — i 



^2 II ?*(ei),*(ea),K(e)- (3-1) 



-* <— ee[n]{2> 



Unfortunately, this exact expression does not appear to simplify in any significant way. 
(Ironically, its dynamic counterpart does simplify: cf. Section |6j) 

A strategy for approximating p\ is to use only the most relevant information in the 
graph. For example, we could divide the edges into two classes: those that contain v and 
those that do not. Edges in the former class have more direct relevance to the question of 



8 



which community v belongs to. If we let k v denote the restriction of the graph k to edges 
containing v, and K v be the corresponding random variable, then we may approximate p\ 
by p\ = Pr($(f ) = i\K v = k v ). By Bayesian inversion this equals 

PI oc Pr($(f) = z)Pr(K 1) = k v \$(v) = i) 



= Pi Y[Pr(K({v,x}) = k({v,x})\$(v) = i) = Pi JJ zJ^'^>({^})- 

x^v x^v j=l 

This equation exploits the statistical distribution of edge types that tend to emanate from 



a given community: if Q is such that this information is distinctive, then (3.2) will perform 
well. However, because it assesses each node in isolation, it does not exploit network structure 
and will not perform well when Q fails to produce distinctive edge-type distributions. 

If there were multiple, conditionally independent graph snapshots for a given ground- 



truth 0, then one could replace pi with p v ~ in (3.2), and pj with to get an updated 
value P1 + . One could initialize these values p v ~ to the prior p i and apply the update equation 
for each graph snapshot k: this would introduce communication between the results for 
individual nodes and thus exploit network structure. The approach in Section [6] is a more 
sophisticated version of this, which allows the temporal sequence of graphs to be correlated 
and nodes to move between communities. 

To derive useful probabilistic information that exploits network structure rather than 
just the statistical characteristics of edge-type distributions we turn to the second-order 
statistics p™ = Pr($(t>) = i,<&(w) = j\K = k). To approximate this, we may divide 
the edges into three classes: the edge {v,w}, the edges containing either v or w (but not 
both), and the edges containing neither. One gets a rather trivial approximation using only 
the single edge {v,w}, but using edges from the first two classes yields the approximation 
~vw ^_ Yx(§{v) = i,<&(w) = j\K v = K V ,K W = k w ). This quantity has a formula similar 



to (3.2): 



#7 ex Pr($(«) = z, $0) = j')Pr(K„ = Kv , K w = k w \$(v) = i, $ W = j) 
= piPjPv(K({v, w}) = k({v, w})\$(v) = i, = j)x 

Yl Pr(K({v,x}) = k({v,x}),K({w,x}) = k({w, x})\^(v) =i,$(w) = j) ^ ^ 



x=£v,w 



PiPjQij,K({v,w}) J J ^ PkQik,K({v,x})Qjk,K({w,a;}) ■ 
x^v,vi k=l 

(The version that uses only the single edge {v, w} as evidence is given by omitting the final 



product in (3.3).) This formula provides important statistical information even when Q is 
completely symmetric. Indeed, to exploit p™ it is simpler to work with the symmetric case. 

3.2 Planted partition model 



When p and Q are symmetric under permutations of [m], then (3.1) reduces to pi = 1/m 



(and (3.2) to p\ = 1/m). This is because in the symmetric case H(n,m,pj,po) community 



labels have no meaning, so first-order statistics become trivial. The simplest, non-trivial 
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quantities to compute are the second-order statistics p™ . In the symmetric case, they 
reduce to the single probability p^ v < w ^ that v and w are in the same community: i.e., p{ v > w * = 
Pr($(f ) = $(w)|K = k). To compute p^ v ' w ^ exactly requires a summation over all partitions. 
Reichardt and Bornholdt estimated the p^ v ' w ^ matrix by a Monte Carlo sampling of the 
partition space, but this is slow [S2]- Instead of this, we may approximate p{ v > w } directly by 



simplifying (3.3). This leads to fairly simple expressions. The meaning of these expressions is 
opaque, however, when derived through straightforward mathematical manipulations, which 
creates problems when trying to adapt the results to engineering contexts. Therefore we 
proceed along more general lines to demonstrate which aspects of the partition-graph model 
lead to which aspects of the resulting expressions. 

Suppose instances of some random process V are partition-graph pairs (tt, G) on n nodes. 
This process is not necessarily H(n,m,pi,po): we will later take V to be a somewhat 
more complex process in which the parameters m, pj, and po are first drawn from some 
distribution, and then an instance of H(n,m,pi,po) is generated. Let M vw be the indicator 
random variable for the event that v and w are in the same community (i.e., M vw = 1 when 
v and w are in the same community, and otherwise), and K vw be the indicator random 
variable for the existence of an edge between v and w. Now let k vw indicate the presence 
or absence of the edge {v,w} in some given graph G (i.e., k vw = 1 if {v,w} is an edge of 
G, and otherwise). Thus, the k vw are data, rather than instances of K vw . We define J vw 
to be the indicator random variable for K vw agreeing with this datum k vw (i.e., J vw = 1 if 
K vw = k vw , and otherwise. We may express J vw as 

Jvw 1 I^VW ( 1) K !•»/)• (3.4) 

Now let E, vw be the indicator random variable for V agreeing exactly with G on all edges 
containing v and/or w. We may express this as 

1 — "vw Jvw J"J JvxJwx' (3'5) 

The approximation to p^ v ' w ^ based on using only local graph information may then be 
written p^ v ' w ^ = Pr(M vw = 1\E VW = 1) = EfMy^lH^ = 1]. This can be expressed as 

~ {ViW} = Pt(E vw = 1\M VW = l)Pr(M vw = 1) = 

P ' Pr(5 w = 1) A vw + E[M W ]-! -V 1 ' j 

where the likelihood ratio A vw is given by 

A ^[^VW I M vw 1] . . 

A ™ ~ e[e vw \m vw = oy (6J) 



To evaluate E[5^|M UU) ] we would like to use (3.5), requiring that V have suitably fa- 
vorable properties. If V — H(n,m,pi,po), then the random variables J vw , and each of the 
JvxJwx for x 7^ v,w are conditionally independent given M vw . E.g., if M vw = 1 (i.e., v and 
w are in the same community), then K vw = 1 with probability pj, independent of the values 
of any other K xy . However, if V is a process in which a parameter vector p, is first drawn 
from some distribution, and then a draw is made from some process V(p), then assumption 
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of conditional independence is far too restrictive. In such a case the existence of many edges 
elsewhere in the graph would suggest a large value of a parameter like pj, and hence a larger 
value of K vw , so this random variable would not be conditionally independent of the other 
K xy given M vw . 

This problem is easily overcome, however. We simply decompose the expected value into 
the conditional expectation for a specified value of ft, followed by an expectation over ft. 
E.g., we write E[5 W |M W ] as 

E[E VW \M VW ] = E p [E[E vw \M vw ,jt}}. (3.8) 

We then stipulate that J vw and each of the J VX J WX for x ^ v , w are conditionally independent 
given M vw and ft. Then 

E[E VW \M VW , ft\ =E[J VW \M VW , ft] ]J[ E[J VX J WX \M VW , ft\. (3.9) 

We may express the factors in the product in terms of a covariance: 

E[J VX J WX \M VW , ft] — E[J vx \M vw ,jt]E[J wx \M vw ,ft] + Cov(J vx , J WX \M VW , ft). (3.10) 
We make the further assumption that K vx is conditionally independent of M vw given ft 



(which, again, holds for H(n,m,pi,po))- Then, using (3.4) we have 

E[J VX J WX \M VW , ft] = E[J vx \ft\E[J wx \ft\ + (-l) K ™+^Cov(K vx ,K wx \M vw ,ft). (3.11) 

We introduce the following notation 

fi = E[M vw \ftl V + = Cov(K vx ,K wx \M vw = I, ft), (3.12) 
6 = E[K vx \fl], V" = Cov(K vx ,K wx \M vw = 0,/x). (3.13) 

In this symmetric scenario all quantities are invariant under node permutations. Thus /i 
is the probability that two randomly chosen nodes are in the same community (for fixed 
parameters ft), and 5 is the probability that two random chosen nodes have an edge between 
them. We write E[J VX J WX \M VW , ft] in terms of these quantities: 

{(l-5) 2 + 4> Hk vx + k wx = 0, 
5(1 — 5)- ip if k vx + k wx = 1, (3-14) 
5 2 + ip if k vx + k wx = 2, 



where 



v>=r + ifM ^ =0 ' (3.i5) 



We may use this to express (3.9) as 



E[E vw \M vw ,fI}=E[J vw \M vw ,fI}f{5,^), (3.16) 

where 

f(5, J,) = ((1 - 5f + ^) no (6(1 - 6) - (5 2 + ^) n \ (3.17) 
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Here rij denotes the number of nodes (aside from v and w) adjacent to exactly j of {v,w}, 
and n + n\ + n 2 = n — 2. 

Now to compute E[S^|M„u,] we substitute (3.16) into (3.8). To evaluate the expectation 



E^[-] of (3.16) requires a specific random graph model V. We will use the following V: we 
will select the number of communities m in a manner to be discussed below, and select pj 
and po uniformly from < po < Pi < 1. Then we shall make a draw from H(n,m,pj,po) 
to generate a partition-graph pair (it, G). For this model we have 



as well as 



if 



/i = 1/m, and 5 = \ipi + (1 — /i)po, 



(<* ~Po)(Pi -S) = M 1 - v){Pi-PoY > 0, and 



-/i 2 (P/-Po) 2 <0. 



Finally, the leading factor in (3.16) is 



E[J VW \M VW = l,/7] — pi if k vw 
E[J VW \M VW = 0,/J] = po if «wo 



E[J TO |M„ 



1) A*] 
0,/i] 



1 - p! if /t r 
1 - po if K t 



0. 
0. 



(3.18) 



(3.19) 
(3.20) 



(3.21) 
(3.22) 



We may split the expectation E^[-] into an integral over pj and po followed by an expec- 



tation with respect to m. Then (|3.8|) becomes 
E [^^tij | ^M vw 



E r 



o 



VI 



E[J vw \M vw ,jl]f(8,ip) dpodpi 



(3.23) 



We may change coordinates from (j>o,Pi) to (5, for M vw = 1 and to (5, for M vw = 0. 
This introduces complications due to Jacobians and complicated regions of integration R + 
and R~ , but it is helpful to be in the natural coordinate system of /: 



E[E VW \M VW = 1] = Er. 
E[E VW \M VW = 0] = E r . 



E[J VW \M VW = 



R+ y/fj,(l ~ + 

E[J VW \M VW = 0,/2] 

R- 



f(8,ij + )d6dij + 
f(6,i/j-)d6dip- 



(3.24) 
(3.25) 



In the M vw = 1 case, the range < po < Pi < 1 is transformed into the following region R + : 
ip + = to 5 2 {1— n) / n for 5 = to ji and ip + = to (1 — 5) 2 ji/{l— n) for 5 = jj, to 1. Similarly, 
in the M vw = case it is transformed into the following region R~: ip~ = —5 2 to for 

■5) 2 (/i/(l-/i)) 2 to for 5 = ii to 1. To compute (f^24|) and fl3~25]) 



1 



5 = to and -0 = 

numerically one would transform the expressions (3.21 ) and (3.22 ) into (5, ip) space, although 



it seems to be more numerically stable to use the expressions (3.19) and (3.20) in (3.23) 



For small n, this numerical integration is feasible. The following example employs numerical 
integration for a dataset with n = 34 nodes. 

Figure [l] shows which pairs of nodes are particularly likely or unlikely to be in the same 
community for Zachary's karate club data [79]. This is a social network of 34 members of 
a karate club at a university which split into two communities. Nodes 4 and 8 are most 
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Figure 1: p {v ' w} plot for Zachary's karate club: blue: > 60%; red: p^ < 1.8% 

likely, with p^' 8 ^ = 98.8%, and nodes 1 and 34 least likely to be in the same community with 
p{!>34} _ o.65%. Being adjacent does not guarantee a high value of p^ v ' w ^\ the node pairs 
{1, 32} and {14, 34} each have p {v ' w} = 8.9%. Nor is it necessary for nodes to be adjacent to 
have a high value of p{ v > w }-. among nodes 15, 16, 19, 21, and 23 p{ v > w } = 84.5% for every pair, 
and p^ 8 ' 14 } = 96.1%. Note that the node pairs {8, 14} and {1, 34} are each non-adjacent, and 
each has four common neighbors (i.e., n 2 = 4), but their values of p^ v ' w ^ differ by a factor 
of 150 because of the degrees of the nodes involved. Finally, although nodes 9 and 31 are in 
different ground-truth communities, p^ 9 ' 31 ^ = 92.1%. 

When numerical integration is not feasible, it is difficult to obtain good asymptotic 
estimates as n — > 00, so we will resort to heuristics. The function / has a global maximum 
which is increasingly sharp as n — > 00. This occurs at 

n 1 + 2n 2 . An n 2 - n\ 

S '=2^)' ^ = 4(n - W • ( } 

If > 0, then this peak lies within R + when ip p / (ip p +(l—5 p ) 2 ) < ji < 8 p / {jp p +5 p ). Assuming 
the expectation E m [-] has significant weight in this range, one can replace / with a delta 
function at (5 P , tp p ) to estimate EfH^lMy^, = 1] (i.e., parameter estimation is an appropriate 
approximation to the full, Bayesian integration). To estimate EfH^lM^ = 0] in this case, 
one could make the same argument, but with the maximum of f(5,tp~) constrained to 
ip~ < 0. Conveniently, this constrained maximum occurs at (5 P , 0). The analogous argument 
works when tp p < 0. Therefore, if the integrals in ( |3.24[ ) and ( |3.25 ) contained nothing but 



f(S,ijj) we could approximate them by f(S p ,^ p ) or f(5 p ,0) as appropriate. Ignoring the 



expectation over m as well, we could substitute these expressions into (3.7) to obtain the 
following approximation 

X „ {fQpM/W P ,0) if^ P >0, and 
™~ 1/(M)//(<Wp) if^ P <0. 
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We may decompose A as A = AA&, where Ai encompasses all corrections to the crude 
approximation A when there is an edge between v and w, and A when there is no edge. 
For this, we need to specify the prior on m. Here, we let logm vary uniformly from log 2 to 
log n. It is convenient to treat m (or, equivalently, /i) as a continuum variable here to avoid 
the accidents of discreteness. With this prior, we may compute the value of E[My W ] required 



in (3.6) as 



P = E[M_] = \ /2 (3.28) 
log(n/2) 

When ip p differs greatly from 0, the A factor is very large or small and dominates the 
correction term A/-. Therefore we seek to approximate the correction factor only in the 
critical case ip p = 0. Typically, real-world graphs are sparse, in which case the lack of 
an edge between v and w decreases their co-membership probability only slightly, but the 
presence of an edge greatly enhances it. Numerical experimentation confirms this intuition: 
the correction factor A due to the absence of an edge is roughly constant, but the factor 
A x due to the presence of the edge {v, w} increases rapidly as 5 P decreases until it hits a 
constant plateau (which varies with n): 

A = min(0.7197, 0.46 5~ 015 ), and (3.29) 
Ai =min(0.5605n + 1.598, 5~ - 7 ). (3.30) 

The four-digit coefficients in these formulas are obtained from an asymptotic analysis of 
exact results obtained in the n\ = = case. These exact results involve combinations of 
generalized hypergeometric functions (i.e., p F q (a; b; z)), and are not particularly enlightening, 
although they can be used to obtain accurate coefficients, such as 0.56051044368284805729 



rather than 0.5605 in (3.30). 



Putting the above together into (3.6), we obtain the following approximation to p 



{v,w} . 



p{v,w} _ ^~ Kvw ^~ (3 31) 

A K A + R- 1 - 1 

This formula could certainly be improved. It often yields results such as p{ v > w } = 1 — 10~ 20 : 
this figure might be accurate given the model assumptions, but such certainty could never 
be attained in the real world. To make it more accurate a more sophisticated model could be 
used, or the priors on pi, po, and m could be matched more closely to reality. Only limited 
improvement is possible, however, because in reality multiple, overlapping, fuzzily-defined 
community structures typically exist at various scales, and it is unclear what p^ v ' w ^ means in 
such a context. Certainly the integral approximations could be performed more rigorously 
and accurately. The broad outlines of the behavior of p^ v ' w ^ are captured in A, A , and 
Ai, however. Finally, using only local evidence constitutes a rather radical pruning of the 



information in G. However, it is because of this pruning that the approximation (3.31) can 
be implemented so efficiently. 

4 Community Detection: Results 

Direct visualizations like Figure [T] are impractical for larger networks. It can be useful to 
use the blue edges in Figure 111 (those with above a certain threshold) in place of a 
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Figure 2: Visualization of p^-™} matrix for the Enron email network 



graph's edges in network algorithms (such as graph layout): this is discussed in Section 4.3 



However, a more direct use of the co-membership matrix pv ) ' w s for network visualization is 
simply to plot the matrix itself with the values of p^ v ' w ^ e [0, 1] as intensities [61, 62J. An 
example of this is shown in Figure [2j using the approximation (3.31), for the Enron email 
communication network, which has 36,692 nodes and 367,662 edges [48J. The insets depict 
the hierarchical organization of community structure in networks [121 01] : communities with 
various structures exist at all scales. Although the model H(n,m,pi,po) does not account 
for hierarchical structure, a benefit of integrating over the number of communities m (rather 
than estimating it) is that this accounts for co-membership at different scales. 



4.1 Accuracy 

One may rightly question whether the approximation accurate, given the modeling 

assumptions and approximations that it is based on. To address this, we observe that the 
values of p^ v,w ^ may be used to define a certain family expected utility functions (parameter- 



ized by a threshold probability 9: cf. (A. 14) in Appendix |Aj), and optimizing this expected 
utility yields a traditional community detection algorithm. Because a great many community 
detection algorithms have been developed, one can assess the quality of the approximation 
p{v,w} ^ p{v,w} com p ar i n g the performance of the resulting community detection algorithm 
to those in the literature. 

The most comprehensive comparison to date is based on the LFR benchmark graphs 
which have power-law distributions both on degree and on community size [S] . The conclu- 
sion is that all algorithms prior to 2008 were eclipsed by a set of three more recent algorithms: 
the Louvain algorithm [8j, the Ronhovde-Nussinov (RN) algorithm [63J, and Infomap [64J. 
Infomap performed somewhat better than RN, and both somewhat better than Louvain, but 
all three were much better than the previous generation. Figure [3] compares our algorithm 
to the Infomap, RN, and Louvain algorithms, and to the other algorithms tested in [43J. 
(This figure is a correction of Figure 3 from [27J. Also, the Simulated Annealing method 
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H fx 

(a) 1000 nodes, small communities (b) 1000 nodes, large communities 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 

H fx 

(c) 5000 nodes, small communities (d) 5000 nodes, large communities 



Figure 3: Comparison of community detection algorithms: •: U op t] ♦: Infomap; : RN; 
A: Louvain; — : Other methods. 



which works so well in panel (b) is highlighted (purple with shorter dashes) in all panels for 
comparison.) Our method is labeled U op t because numerical optimization over all 9 G [0, 1] 
has been used to set 9 to the value that maximizes NMI. Both the 1000- and 5000-node cases 
are shown, for small communities (10 to 50 nodes) and large ones (20 to 100). The x-axis 
is the mixing parameter /i — the fraction of a node's neighbors outside its community {not 
the expected edge probability /i of (3.12)) — and the y-axis is a particular version (cf. the 
appendix of [33]) of the Normalized Mutual Information (NMI) between the computed and 
the true partition. In all cases, our method U op t exhibits the performance characteristic of 
the three state-of-the-art methods cited by [33]. The U opt method has an unfair advantage 
in optimizing over all 9: in a deployable algorithm one would need a method for setting 9. 
On the other hand, the purpose of Figure [3] is simply to show that the p{ v > w } computation 
retains enough information about community structure to reconstruct high-quality hard-call 
solutions. From this perspective, it is surprising that it does so well, because U opt is based 
on (a) the simple utility function of Appendix [XJ (b) an approximation p{ v ' w } to p^ v > w ^ based 
only on limited evidence, and (c) an approximation p* v ' w } to p^- v ' w ^ based on a heuristic 
evalution of the required integral. 
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4.2 Efficiency 

The algorithm for computing the p^ v,w ^ begins with pre-computing the value of n 2 for all 
pairs of nodes for which n 2 > 0, then creating a cache of p{ v > w s values for triples (k vw , ni, n 2 ). 
For any node pair {v,w}, the value of p^ v ' w ^ can be computed by first looking up its value 
of 712, computing uq and n\ from n and the degrees of v and w, then looking up the p^ v ' w ^ 
value for its triple (k vw , n 1; n 2 ). Occasionally the value for this triple must be computed 



from (3.31) and cached, but the number of such distinct triples is relatively small in practice. 
An optional additional step one can perform is to loop over all node pairs with non-zero n 2 in 
order to both fill in the value of p^ v ' w s for each triple (k vw , ni, 11*2), and count the number of 
times each triple occurs. (Because only the {v,w} pairs with n 2 > are looped over, some 
additional bookkeeping is needed to fill in and provide a count for the (k vw ,tii,0) triples 
without actually iterating over all 0{n 2 ) node pairs.) These values and counts are useful for 
the statistical analysis of the p{ v > w s distribution. 

We tested the algorithm on five different Facebook networks (gathered from various uni- 
versities) [H], and networks generated from Slashdot [19], Amazon [17], Live Journal [49J, 
and connections between . edu domains (Wb-edu) [19] . Table [l] shows various relevant net- 
work statistics. The sum of the values n 2 for each node pair is the number of calculations 
needed to compute the n 2 data structure, whereas the number of values of n 2 > reflects its 
size. The next column is the number of distinct (k vw , ni, n 2 ) triples — this is the number of 
distinct p^ v ' w ^ values that must be computed, and the final one is the number of communities 
that a randomly chosen instance of the algorithm Infomap [6l] found for the dataset. For 
the last two rows the n 2 data structure was too large to hold in memory, and the second 
step of counting the triples was not performed, nor could Infomap be run successfully on our 
desktop. 



Dataset 


Nodes 


Edges 


X>2 


#(n 2 > 0) 


Triples 


Groups 


Caltech 


769 


16,656 


1,231,412 


186,722 


14,120 


19 


Princeton 


6,596 


293,320 


46,139,701 


8,776,074 


83,004 


51 


Georgetown 


9,414 


425,638 


67,751,053 


15,616,610 


113,722 


90 


Oklahoma 


17,425 


892,528 


194,235,901 


47,202,925 


239,162 


233 


UNC 


18,163 


766,800 


140,796,299 


47,576,619 


191,482 


167 


Slashdot 


82,168 


504,230 


74,983,589 


49,450,449 


104,330 


5,209 


Amazon 


262,111 


899,792 


9,120,350 


6,434,638 


7,178 


12,851 


LiveJournal 


4,847,571 


42,851,237 


7,269,503,753 


4,193,393,006 






Wb-edu 


9,450,404 


46,236,105 


12,203,737,639 


4,232,928,806 







Table 1: Network datasets 



Table [2] contains timing results based on a Dell desktop with 8GB of RAM, and eight 
2.5GHz processors. The first column is the number of seconds it took the version of Infomap 
described in p3] to run. This code is in C++, runs single-threaded, includes a small amount 
of overhead for reading the network, and uses the Infomap default setting of picking the best 
result from ten individual Infomap trial partitions. The next four columns compare methods 
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of using our Java code to compute n 2 - The first two are single-threaded, and the other two 
use all eight cores. The columns labeled n 2 — )■ are the timing for the computation only: 
results are nulled out immediately after computing them. These columns are included for 
two reasons. First, they show that the computation itself displays good parallelization: the 
speedup is generally higher than 6.5 for eight processors. Second, the computation itself 
for the two larger datasets is quite fast (just under three minutes on eight cores), but the 
algorithm is currently designed only to maintain all results in RAM, and these datasets are 
too large for this. The last two columns are the timing results for explicitly filling in the 
pi v M information ahead of time and providing the counts required for statistical analysis. 

As Infomap is one of the fastest community detection algorithms available, these results 
are quite impressive. Comparing the first two columns of timing data, we find the speed-up 
ranges up to 107 and 862 times as fast as Infomap for the two largest networks on which we 
ran Infomap, respectively. The relative performance falls off rapidly for denser networks, but 
even in these cases or method performed roughly 10 times as fast as Infomap (i.e., as fast 
as an individual Infomap run). Computing the counts for p{ v > w } statistics increases the run 
time, but only by a constant factor (of 2 to 3) . It must be emphasized that the timing results 
are for producing a very different kind of output than Infomap does. However, the usual 
method for estimating p^ v < w ^ [62j is many times slower than producing partitions, rather 
than many times faster. 



Dataset 


Infomap 
10 trials 


n 2 
1 proc 


n 2 -> 
1 proc 


n 2 
8 proc 


n 2 -> 
8 proc 


Count 
1 proc 


Count 
8 proc 


Caltech 


1.0 


0.11 


0.09 


0.02 


0.02 


0.09 


0.03 


Princeton 


45.4 


4.5 


3.9 


0.67 


0.55 


3.9 


0.70 


Georgetown 


77.5 


7.2 


6.4 


1.2 


0.93 


7.0 


1.2 


Oklahoma 


310.8 


23.5 


19.4 


4.2 


2.8 


33.7 


4.6 


UNC 


446.6 


19.2 


16.3 


3.6 


2.5 


22.9 


4.0 


Slashdot 


1553.4 


14.5 


12.6 


3.0 


1.9 


21.8 


3.6 


Amazon 


2075.8 


2.4 


2.2 


0.59 


0.44 


3.2 


0.69 


LiveJournal 






1246.2 




172.9 






Wb-edu 






1243.0 




174.6 







Table 2: Timing results (in seconds) 



4.3 Uses 

Visualizing the p^ v ' w ^ matrix provides insight into the hierarchical community structure of a 
network [62J . To do so requires an appropriate ordering on the nodes: one that places nodes 
nearby when they are in the same small community, and small communities nearby when 
they are in the same larger community. Therefore, it is useful to have a dendrogram of hierar- 
chical community clustering. There are standard routines for producing such dendrograms, 
provided one has a distance metric between points [36]. The community non-co-membership 
probabilities pWW = 1 — p\ v M may be used for this purpose. They satisfy the triangle 
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(a) Dendrogram for node ordering (b) p^ v ' w ^ matrix 

Figure 4: Visualization of community structure for Princeton Facebook network 
inequality pWW < pWW (although distinct nodes have distance between them 



if they are known to be in the same community). The approximation (3.31) does not obey 
this triangle inequality, however, and [26J indicates that this can cause problems in certain 
contexts. An approximation of p^ v < w ^ which does obey it is a topic for future research. In 
the meantime, we will rely on this version, which tends to obey it for most node triples. 

To order the nodes of a graph we use this (approximate) distance metric pW-M with a 
hierarchical clustering scheme that defines cluster distance as the average distance between 
nodes. The output of this is a binary tree (ties having been broken arbitrarily), so the order 
of the clusters at each branch point must still be determined. This is done by starting at 
the root of the tree and testing which ordering at each branch point yields a smaller average 
distance to its neighboring clusters on the right and left. Figure [4] shows the resulting 
dendrogram for the Princeton Facebook network (cf. Table [l|, and the corresponding p^ v > w ^ 
matrix. (This method was used to generate the ordering in Figure [2] as well.) 

Visualizations like Figure [4] are useful for network analysis. We have combined them other 
visualizations in the code IGNITE (Inter-Group Network Inference and Tracking Engine). 
Figure [5] uses IGNITE to the Georgetown Facebook network (cf. Table [T]). The dendrogram 
on which the ordering for the pi v > w } matrix is based is shown in the upper left panel. Two 
levels in this dendrogram have been selected: the lower level is used to coarse-grain the 
network by merging communities of nodes together into meta-nodes; the upper level is used 
to determine which sets of meta-nodes to consider communities. The selection of these levels 
is reflected in the p^ v ' w ^ matrix panel below. The meta-nodes are indicated by translucent 
green squares, and communities of nodes are outlined in different colors (corresponding to 
similar outlines in the dendrogram). The meta-nodes and communities are then displayed in 
panels on the right: the upper panel corresponding to a coarse-grained version of the original 
network; the lower, to a variant where the edges have been replaced with averaged p^ v ' w s 
values between meta-nodes. The sizes of the meta-nodes indicates how many true nodes they 
contain. In the lower right panel, blue lines indicate average p{ v ' w * above a certain threshold, 
and red below another (much lower) threshold, as in Figure [TJ 
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Figure 5: Screen shot of IGNITE network probability visualization tool for Georgetown 
Facebook network 

5 Community Tracking: Exact Equations 

The study of dynamic networks is much less developed than its static counterpart. There is 
substantial work on processes evolving on networks. For example, see [18] for a discussion of 
the complex dynamical systems which arise in economics and traffic engineering, along with 
mathematically rigorous results about their equilibria. Diffusion equations on networks have 
particularly elegant properties. They are governed by the Laplacian matrix of a graph, the 
discrete analog of continuum Laplacian operator, and are therefore an important topic in 
spectral graph theory [51] . These may be generalized to reaction-diffusion equations and used 
to model the spread of disease [13] . but the more common model in network epidemiology 
is the SIR model [12]. Such models have been extended to model the spread of rumors [3], 
obesity [UJ, and innovations [75] . 

The term "dynamic networks" implies that the networks themselves are evolving in time, 
however. Stokman and Doreian edited several influential special editions of the Journal of 
Mathematical Sociology on network evolution, the first of which was in 1996 and published 
in book form as ^2T\. This work illustrated how macroscopic behavior of network evolution 
arises from local governing laws. Snijders emphasizes [68] the benefits of casting the dy- 
namic network problem in the continuous-time, Markov process framework first proposed by 
Leenders [16] . In particular, there is a small body of literature on communities evolving in 
dynamic networks. Much of this work is summarized in 3 1/2 pages of Fortunato's 100-page 
review of group finding [28]. The field begins with the 2004 work of Hopcroft et al. [35J, 
which studied the persistence of robust communities in the NEC CiteSeer database. The 
most prominent publication is the 2007 work of Palla et al. [59], which analyzed the evo- 
lution of overlapping groups in cell phone and co-authorship data and presented a method 
for tracking communities based on the clique percolation method used in CFinder [60] . The 
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various researchers in the community have come to agree on the key fundamental events 
of community evolution: birth/death, expansion/contraction, and merging/splitting |31j. 
Berger-Wolf and colleagues propose an optimality criterion for assigning time-evolving com- 
munity structure to a sequence of network snapshots, prove that it is NP-hard to find the 
optimal structure, and develop various approximation techniques [73l 1721 ITT]. 

Most of the work on community tracking considers discrete network snapshots and at- 
tempts to match up the community structure at different time steps. From the perspective 
of the data fusion community, such an approach to tracking may seem ad hoc: one could 
argue that (a) the "matching up" criteria are necessarily heuristic, and (b) one gets only a 
single best solution with no indication of the uncertainty. In contrast, the tracking work in 
data fusion is based on formal evolution and measurement models for the full probability 
distribution over some state space, followed by principled approximations [39J. A sensible 
response to this critique, however, is that the state space in the community tracking problem 
is so much larger that data-fusion- style tracking techniques do not apply. The truth is per- 
haps somewhere in between: it is, in fact, possible to derive a formal Bayesian filter for the 
community tracking problem and to produce tractable approximations to it. Indeed, this is 
the topic of the remainder of this paper. On the other hand, the filter derived is more a proof 
of concept than an algorithm ready to supplant the more informal methods. It may be that 
the formal approach we present here can be developed into a true, practical "Kalman filter 
for networks." On the other hand, it may be that concerns about uncertainty management 
can be addressed without appeal to a formal model. For example, Rosvall and Bergstrom 
have devised a re-sampling technique to estimate the degree to which the data support the 
various assignments of nodes to time-evolving communities [65] . Similarly, the work of Fenn 
et al. tracks the evolution of groups by gathering evidence that each node belongs to one of 
a number of known groups |23j , thus providing output similar to one version of the method 
outlined below. 

We model community and graph evolution as a continuous-time Markov process [16], 
{(3>t, K t ) : t > 0}, the continuous-time analog of a Markov chain. The continuous-time 
approach is more general (in that it can always be sampled at discrete times to produce 
a Markov chain) and is simpler to work with due to the sparsity of the matrices involved. 
We will not explicitly indicate any dependence on structural parameters /Z, because we 
will not integrate these out as was done in the static case. In this section we will use 
K[o,t] = {Kf : < t' < t} to denote the time-history of the network process up through time 
t, and K[ 0) t) = {K t / : < if < t} for the history not including the current time t (with similar 
definitions for $[o,t] and <&[o,t))- The purpose of this section is to derive a Bayesian filter 



Pr($ t |K[o 5 t]): i.e., assuming some initial distribution Pr($o) is given, Section 5.1 derives 
the expressions for evolving the distribution of the community structure to time t, given 
all network evidence up through time t. In the community detection case, the next step 
was to approximate marginals of the full distribution using limited graph evidence. In the 
tracking case, however, it is possible to obtain exact formulas for the marginals: this is done 
in Section [5\2 These formulas, though exact, are not closed, however: the approximations 
required to close them are discussed in Section [6j 
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Figure 6: An instance of H(12, 3, 0.5, 16, 4, 2, 18) 



5.1 Evolution of the full distribution 

A dynamic stochastic blockmodell-iin^ A, B) may be denned analogously to the static version 



H(n } p, Q) introduced in Section 2.1 Whereas H(n } p, Q) defines a pair of random variables 
$ and K, T-L(n, A, B) defines a pair of stochastic processes {$ 4 : t > 0} and {K t : t > 0}. The 
joint process {($t,K() : t > 0} will be modeled as a continuous-time Markov process. The 
parameter A is an m x m matrix, and B is a collection of r x r matrices B^ for 1 < i < j < in 
(and for convenience we define Bji = Bij for j > i). Just as p and the were required to 



be stochastic vectors (i.e., vectors with non-negative entries that sum to one) in Section 2.1 
so the matrices A and B^ are required to be transition rate matrices: i.e., they must have 
non-negative off-diagonal entries, and their columns must sum to zero. The entry ajj of A 
defines the transition rate of a node in group j switching to group i: i.e., the probability of 
a node in group j being in group i after an infinitesimal time At is + At + 0((At) 2 ). 
Similarly, the entry bij^i of B^ defines the rate that an edge connecting nodes in groups i 
and j transitions from edge type I to type k. 

We may define a dynamic planted partition model 7i(n,m,a, A/, fii, Ao, fio) as a special 



case of T-L(n, A, B). As in Section 2.2, this cases is obtained by requiring that A and B be 
invariant under permutations of [m] and using only r = 2 edge types ("off" (k = 0) and 
"on" (k = 1)). In this case, the transition rate matrix A reduces to a single rate a at which 
nodes jump between communities, while the collection B of transition rate matrices reduces 
to four rate parameters: A/, the rate at which edges turn on for pairs of nodes in the same 
community; fii, the rate at which edges turn off for pairs of nodes in the same community; 
and Ao and no, the corresponding rates for pairs of nodes in different communities. Figure [6] 
depicts and instance of this model with n = 12 nodes and m = 3 communities with rate 
parameters a = 0.5, A/ = 16, fij = 4, Ao = 2, and no = 18 for t = to 5. 

If two independent random variables X and Y have respective probabilities X{ and yj for 
their various outcomes i and j, then the joint random variable Z = (X, Y) has outcomes 
indexed by (z, j) with probabilities zn^ = Xiyj. The analog of this for Markov processes 
is expressed by the Kronecker sum [SB]- he., Suppose two independent Markov processes 
{X{t) : t > 0} and [Y(t) : t > 0} have respective probabilities Xiit) and yj{t) for their 
various outcomes i and j at time t, and that these probabilities are governed by x = Ax and 
y = -By, respectively (where x(t) collects all the Xi(t), and y(t), the yj(t)). Then the joint 
Markov process {Z(t) : t > 0}, where Z(t) = (X(t),Y(t)), has outcomes indexed by (i,j) 
with probabilities that are governed by z = (A © B)z (where z(t) collects all the 
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The Kronecker sum A © B is denned by 



(A © = CLu'Sj? + $u'bjj>. (5.1) 

The interpretation of this is that in an infinitesimal time the Markov process X t may tran- 
sition from i to another state i', or Y t may transition from j to f, but for both to change 
simultaneously is infinitely less likely than for only one to change. 

The derivation of the Bayesian filter for Pr($ t |K[ ^i) follows the same six steps as the 



static derivation in Section 2.1 Indeed, the purpose of including the six-step derivation in 



Section 2.1 was to make the following derivation easier to follow by analogy. 



Step 1. Let p v {t) G M m denote the vector of probabilities p^{t) that a single node v is in 
community i at time t: i.e., Pi(t) = Pr($ t (t>) = i). These probabilities are governed by the 
transition rate matrix A. Therefore 



dt 



Ap v , which has the solution p v {t') = e A{t ~ t)v {t >. (5.2) 



Step 2. Let V G W 71 " denote the vector of probabilities V^(t) that the communities of all n 
nodes are specified by the assignment <fi at time t: i.e., V^if) = Pr($ t = <p). The transition 
rate matrix for this joint process on all nodes is the Kronecker sum of the (identical) transition 
rate matrices for each node: 

dV 

— = AV, where „4 = @A (5.3) 

v=l 

The components A<p$ of A may be expressed as 

. n 

/.a^vMv), if </»' = </», 



Am a 



v=l 



apfr*)^*), if 4>'(v) = <f>(v) for all v ^ v* , 
k 0, otherwise. 



(5.4) 



Step 3. Let q e (t) G R N (where N = n(n — l)/2) denote the vector of probabilities q%{t) 
that a single edge e has type k at time t given the current communities of its endpoints: i.e., 
q%{t) = Pr(K t (e) = k\$ t (v), & t (w)). These probabilities are governed by the transition rate 
matrix B^, where i = 4>t{v) and j = <fit{w) are the current communities of v and w: 

dc^ e 

— = BfaWfa^q*. (5.5) 



The matrix B^^^^ is piecewise constant in time, so the solution to (5.5) is a (continuous) 
piecewise exponential function. 

Step 4. Let Q G W N denote the vector of probabilities Q K (t) that the graph is k at time t 
given the current communities of all nodes: i.e., Q K (t) = Pr(K t = The transition rate 



23 



matrix for this joint process on all edges is the Kronecker sum of the transition rate matrices 
for each edge: 

where B 4> = B Hei )<j>(e 2 )- (5-6) 

ee[n]{2> 



dt 



The components Bj, k 'k °f &d> ma Y be expressed as 



&0( ei )0(e 2 ), K (e) K (e), H k' = K, 

e[n]W 



(e*)</.(e*),K'(e*)K(e*), 



if «'(e) = «(e) for all e^e* 
otherwise. 



(5.7) 



Step 5. Let 7£ G R m " r denote the vector of probabilities H^k) that the community 
assignment is and the graph is k at time t: i.e., TZ^ iK ^(t) = Pr($ 4 = 0, K 4 = k). The 
transition rate matrix for this process is not quite a Kronecker sum due to the dependence 
of Btf) on — it loses the various nice properties that Kronecker sums have, but the formula 
is quite similar: 



~dt 



CTZ, where C^> iK i^ K ) — A^^5 K t K + b^^B^^i K . 



(5. 



A Bayesian filter has a prediction step (which applies while the graph data remains constant) 
and an update step (which applies when the graph data changes). Therefore, we need to 



decompose (5.8) into a component which is zero while the graph is constant and a component 



which is zero when the graph changes. The required decomposition uses slightly modified 
matrices A' K and B'^. 

= K^'^k'k + VA^*' where ( 5 - 9 ) 

■^k,<i>'4> = A$$ + $4>'4>Bfa KR , and B ( f >K , K = B^ K < K — 5 K i K B^ Kn . (5.10) 
The components A' K of A' K may be expressed as 



A 1 



/] a<j>(v)4>(v) + / ] ^(ex)0(e2),«(e)«(e)) 
ee[n]{ 2 > 



u=l 



i>'{v*)(j>{v*)i 



0. 



if0' = 0, 

if 4>'(v) = 0(f) for all v ^ v * 
otherwise. 



(5.11) 



The components B', K , K of B', may be expressed as 



n' 



^(eI)^(e|),K'(e*)«(e*), if ^'( e ) = K { e ) for a11 e 7^ e * 

otherwise. 



(5.12) 



Step 6. The prediction and update steps of the Bayesian filter are now determined by the 
matrices A' and B ' . For the prediction step, suppose that the graph data through time t 
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is K[o,t ] an d let k = K to be a concise notation for the graph at time to- From a previous 
step of the filter (or from an initialization) we are given the distribution on the community 
assignments Pr($ to |K[ ,i ] = ft[o,t ])- Starting with this distribution on at time t , let 
1Z K {t) G (for all t > t ) be a vector whose component is the probability that (a) the 
graph remains k during the time interval [to,t), and (b) the community assignment is at 
time t. The initial value of 1Z K is then H K (to) = Pr($ to |Kr 0) t ] = «[o,t ])- Its evolution law is 
given by 

Note that A! K is not a transition rate matrix: it allows probability to leak out of the vector 
TZ K (t) so that its sum does not remain 1, but rather equals the probability Pr(K[ tQ)t ) = 
k|K[ 0A) ] = «[o,t ])- Normalizing TZ K (t), however, gives us the probability distribution of 
given that the graph has remained k during the time interval [to,t): 

Pr($ t = (f>\K m = K [0>t) ) oc (7^(t))^. (5.14) 

This, then, is the prediction step of the Bayesian filter. The update step is obtained from 
£>': given that the community assignment is 0, the probability of a single edge e = {v,w} 
transitioning from type k = K t -(e) to type k! = Kt(e) in some infinitesimal time period At is 



given by (5.12) as &0(u)0(u>),fe'feAt. The conditional probabilities of the posterior distribution 



on given this single edge transition are proportional to this. Therefore 

Pr($ i = (f)\K m = K [0t t}) oc b 4> ( v w w ) tk , k Pr($ t - = 0|K [Oit ) = «[o,t))- (5.15) 

This equation holds only for a single edge transitioning. When multiple edges transition at 
exactly the same time, the correct update procedure would be to average, over all possible 



orderings of the edge transitions, the result of applying (5.15) successively to each transition. 

Figure [7] shows the exact evolution for the very simple, dynamic planted partition case 
7i(3, 2, 1, 3, 1, 1, 3). Here there are n = 3 nodes and m = 2 communities, so only four possible 
partitions of the nodes (all in one community, or one of the three nodes by itself). Each 
partition corresponds to two community assignments (of equal probability), and we plot 
the probability Pr($ t = 0|K[ O ,t] = K{o,t]) for assignments from each of these four partitions. 
The graph data is shown in top row of the figure: the graph is initially empty, then an edge 
turns on, then another, and then an edge turns off. We observe that while the graph is some 
constant k the probabilities decay exponentially toward a steady state vector (the null vector 
of A' K ). When the graph changes, the probability of each community assignment hypothesis 
jumps, then begins decaying toward a new steady state. The bottom row of Figure [7] shows 
the ground truth time-history t of community assignments which were used to generate the 
graph data, and the yellow highlighting in the figure indicates which community assignment 
hypothesis is the true one. Further details may be found in [24J. 

5.2 Marginalization 



Using notation similar to that in Section 3.1, let Pi(t) = Pr($ t (f) = i|K[o,t] = K [o,t]) be the 



probability that node v is in community i at time t, Pij"(t) = Pr($ 4 (t>) = i, <3> f (w7 ) = j|K[ 0) t] = 
K\o,t]), and so on. Note that the same notation (t) was used in Step 1 of the derivation in 
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the previous section to denote a prior probability, but henceforth it will indicate a quantity 
conditioned on the graph data K[0, t]. To indicate conditioning on K[0,t) we will use the 
notation t~: e.g., Pi(t~) = Pr($i(t>) = i |K[ ,t) = ^[o,*))- The prediction and update equations 
for the full probability distribution are linear (up to a constant factor), so we can sum them 
over the groups of all nodes aside from v to obtain an expression for Pi{t). When we apply 
this marginalization to (5.13), we get a quantity pl{t) proportional to p^it) (note that this 
use of the notation p\ differs from that of Section 3.1). We let k = K t be a concise notation 
for the graph at time t. The marginalization of (5.13) is then 

m m 

Oij,K({v,w})n({v,w})Pij 



^2 ^^ b jkA{™^)K({™,x})P V ijk X - ( 5 - 16 ) 



1=1 



w£[n] j = l 



{w,x}C[n] j=l k=l 



We may convert this to an equation for Pi(t) itself (albeit a nonlinear one) by expressing p\ 
as Pi(t) = pi(t)/P K (t). The sum of Pi(t) from i — 1 to m equals 1 for every node v, so the 
sum of Pi(t) from i = 1 to m is P K {t) for every v. Because Pi(t) = p^{t)/P K (t) — p^(t)S K (t), 
where S K (t) = P K (t)/P K (t), we have 



Pi 



P1S K + 



({w,x})K,({w,x})Pijk 



(5.17) 



wE[n] j=X 



{w,x}C[n] j = l k=l 



Here S K may be expressed as 



{v,w}C[n] i=l j=l 



(5.11 



We now marginalize the update equation (5.15) when the edge e transitions from type 
I = n t -(e) to type I' = K t (e). To get the updated probability pl{t) that node v is in 
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Figure 8: First-order marginals for community membership 



community i after the transition there are two cases: v 6 e (say, e = {v, w}) and v £ e (say 
e = {w, x}): 



m m 



(5.19) 



EE 6 A'<r) if e = {«;,*}. 

^ 5=1 fe=l 

The normalization constant Si>i(t~) is independent of the node i>: 



i=l j=l 



(5.20) 



Whereas the Bayesian filter (5.14) and (5.15) specifies the exact evolution of probabilities 
in an unmanageably large state space (m n elements), the marginalized filter (5.17) and (5.19) 
involves only mn elements. It is still an exact filter — no approximations have been made — 
but it is not useful as it stands because it is not closed. The equations for the first-order 
statistics p\ involve second- and third-order statistics p™ and p™ k x . One could write down 
equations for these, but they would involve still higher-order statistics, and so on. Instead, a 
closure model is needed for the second- and third-order statistics in terms the p\. The topic 
of closures is discussed in Section [6] 



To verify (5.17) and (5.19), one can run the full filter (5.14) and (5.15) and use it to 



compute the required second- and third-order statistics. The results of evolving p\{t) using 
these oracular terms should then agree with those obtained by marginalizing the full solution 
direction. Figure [8] shows the results of such a verification. A simulation of l-t(n, A, B) with 
n = 12 nodes, m = 3 communities, and r = 4 edge types was used to generate graph data 
«[o,o.8l an d ground-truth community assignment data 0[o,o.8l- The transition rate matrices A 
and used are given in [21]. The final frame (0o.8> K o.s) of this run is shown in Figure [8j 
the centers of each disk correspond to the communities (green, yellow, or purple) which 
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6 .8 assigns to each node; the colors of each edge (white, black, blue, or red) are given by 



k 08 . The same parameters A and B were then used in the Bayesian filter (5.14) and (5.15), 
which required evolving a system of 3 12 ~ 530, 000 quantities. The marginalized probabilities 
p% (0.8) are depicted in the outer bands, so accuracy is indicated by the outer band largely 
agreeing with the center. 

In the case of the dynamic planted partition model T-L(n,m,a, A/, pi, Ao, Po), the first- 
order statistics are trivial: the prediction and update equations reduce to the observation that 
the probability a node v is in some community equals 1. Instead, with some bookkeeping, 
one can derive a Bayesian filter for the second-order statistics p™(t) and reduce these to 
a filter for the co-membership probabilities p^ v ' w ^(t): this is similar to what was done in 



Section 3^ although that was for an approximation based on limited graph evidence, and 
this is exact. The filter for p^ v ' w ^ depends on third- and fourth-order statistics. There are 
5 third-order statistics, which sum to 1, and we denote them pWWt^ p{v}{w,x}^ p{w}{v,x} ^ 



{x}{v,w} 



and p 



{v,w,x} 



These correspond to the probabilities that v, w, and x are in different 



p 

communities, that two are in the same community with v, w, and x, respectively, being in 
another, and that all three are in the same community. Similarly there are 15 fourth-order 
statistics. The two that matter here are p\ v >' w > x >yy and pi v < w }{ x ,y} _ The sum Q f these two is 
the probability that v is in the same community as w and that x is in the same community 
as y. 

The prediction step of the Bayesian filter for p^ v,w ^(t) may be expressed as 



R 



P 



{v,w} 



2am 



m 



rn 



V 



{v,w} 



vw 



vw 



vw vw 
I Hvw 



x£[n] 



{x,y}C[n] 
x , y 7^ v , w 



The form of the update step for p^ v ' w ^ depends on whether the edge e that is flipping at time 
t has 2, 1, or nodes in common with {v,w}: 



P 



{v,w} 



(t) = p^}(t-) + 



rv WX r WX (■!■- 

h~ vw \ b 

i7-siz(t- 



if e = {v, w}, 
if e = {w, x}, 
if e = {x,y}. 



(5.22) 



The notation used in (5.21) and (5.22) is defined as follows. We define ^ft ^° ^ e t ne 



transition rate for an edge to flip (i.e., turn on or off) between nodes v and w at time t 
under the hypothesis that they are in the same community. This transition rate depends on 
whether there is currently an edge between v and w. Therefore, 7^, and its counterpart 
7q™ for the hypothesis that v and w are in different communities, are given by 



vw 

li,t 



Pi if {v,w} G E(G t ), ^ 

A, if {v,w}eE(G t ), 7o '* 



Ho if {v, w} E E(G t ), 
Ao if {v, w} e E(G t ). 



(5.23) 



The quantity j% w , which plays an important role in (5.21) and (5.22), is the difference 
between the flip rates under the two hypotheses: 



It 



vw 

ll,t 



vw 

lo,t- 



(5.24) 



28 




Figure 9: Histograms of third- and fourth-order statistics 



On the other hand, the normalization constant 5 vw (t) in (5.22) is the expected flip probability 



given our current knowledge of the probabilities of the two hypotheses: 

S vw (t) = -f v T w t p {v ' w} {t) + 7o> WW (*), (5.25) 

where p™" 1 ' = 1— p{ y ' w } is the probability of v and w being in different communities. Finally, 
the q, r, and s quantities represent modified second-, third-, and fourth-order statistics, 
respectively: 

qZ=P {vM P {v}{w} , (5.26) 
r vx ± p {v,w,x} _ p {v,w} p {vM j and (5.27) 

s*v = pi",™™} + p {vM{x,v} _ p {vMp{x,y}. ( 5 . 2 8) 

The r and s quantities measure deviations from independence. That is, if the event $i(t>) = 
&t(w) (i.e., v and w are in the same group at time t) were independent of $f(f) = &t{x), 
then the probability of both events occurring (i.e., $<(t>) = $t(u?) = $t(x)) would equal 
the product of their probabilities: that is, pv ) ' w ' x ^ = pi v < w }p{ v , x } ? or r ^ — o. Similarly, 
if v and w being in the same community were independent of x and y being in the same 
community (which seems more plausible), then we would have = 0. The terms R vw 
and S vw represent the accumulated effects on p^ v ' w ^ of the third- and fourth-order deviations 
from independence, respectively. Understanding the role of these quantities is important for 
developing an effective closure for this marginalized filter. 



6 Community Tracking: Approximation 

To develop closures for the and terms in ( |5.21[ ) and ( |5.22[ ) it helpful to know how 
they behave statistically. In this section we present a preliminary investigation of these 
statistics and a possible closure for them using, as an example, the 7/(12,3,0.5, 16,4,2, 18) 
case depicted in Figure 161 For this case we compute histograms for the R vw and S vw terms 



in (5.21). In each case, we compile separate histograms for the case of v and w being in 
the same ground-truth community (case J) and in different communities (case O). The 
histograms for the I and O cases of R vw are shown in Figure |9j In the I case, we tend to 



have R vw < 0, which makes a positive contribution to p{ v > w } in (5.21); whereas in the O 
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_ p{v,w,x} 


= 1 


p{v}{w,x} _ 


_ p{v,w,x) 


= P 


p{w}{v,x} _ 


_ p{v,w,x) 


= P 


p{x}{v,w} _ 


_ p{v,V),x) 


= P 



case the opposite occurs. Thus these third-order statistics make an important contribution 
to the evolution of p^ v ' w \ On the other hand, S vw appears to be much less important. In 
the / case, also shown in Figure |9j S vw is tightly and symmetrically clustered near 0. The 
O case is similar. This makes sense intuitively: as mentioned at the end of Section 5.2 it 
seems more plausible for v and w being in the same community to be independent of x and 
y being in the same community than for v and w to be independent of v and x. Therefore, 
we will make this independence assumption to obtain the fourth-order closure s x l^ = 0. It 
remains to develop a closure for r^. 

The five third-order statistics must sum to one and be consistent with the second-order 
statistics. This is expressed by the following four equations: 

p{v}{w}{x} _|_ p{v}{w,x} _|_ p{w}{v,x} _|_ p{x}{v,w} _j_ p{v,w,x} _ ^Q J 

{w ' x} , (6.2) 

{v ' x} , and (6.3) 
{«M ( 6 .4) 

This leaves one degree of freedom, which we choose pi v > v '> x } to represent. The constraint that 
the variables in (6.1)-(6.4) are non-negative imposes the following bounds on p\ v ' w > x }; 

p~ = - (p {w ' x} + p {v ' x} + p {v ' w} ) , and (6.5) 
p + = min (p {w > x} ,p {v > x} ,p {v > w} ). (6.6) 

For consistency, a closure for p{ v < w > x } should satisfy p~ < p{ v < w < x } < p + . We will consider 
closure models that select p{ y > w > x } e [p~ ,p + ] as a function of p^ w ' x \ pi v ' x } ) and p{ v ' w \ Many 
natural approximations (such as a symmetric version of p\"' w ^ pKu>}pKa;}) fail to satisfy 
p- < p{v,w,x} < fpjjg approximations p{ v < w < x } p + and p{ v > w > x } ps p~ = max(p~,0) have 
poor properties. A least-squares solution is possible, but a better principle to employ is 
maximum entropy [38J. 

When applying the maximum entropy principle, one needs a suitable underlying measure 
space. In this discrete case, this simply means a set of atomic events which are equally likely 
a priori. Such events arise naturally in this case: there are m 3 of them with probabilities 
equal to p™^ ■ I n the absence of graph data, symmetry implies that their probabilities are 
each m~ 3 . Thus the entropy is defined as 

m m in 

i=i j=i k=i 

In this symmetric case, the m 3 values of p™ k x take only five distinct values, so we may 
re-write (6.7) as 

{n}{ii>}{x-} {v}{w,x} {w}{v,x} 

H = - pWMW i og ^_ p {v}{ w ,x} log p_ p {w}{v,x} log p_ — + 

i m )3 {m) 2 (m) 2 

n {x}{v,w} n {v,ui,x} ^ ' ' 

- p^H",*} bg p {«,«^} ^g £ 

(m)2 m 
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Figure 10: Third-order statistics: exact vs. closure 



where (m) r = m\/(m — r)\ denotes the falling factorial. We may use (6.1)-(6.4) to express 
the other variables in terms of p{ v < w < x \ then take the derivative of H with respect to pi v ' w ' x \ 
This reduces to 

/ ( m _ 2) 2 (pi w < x } — p{v,w,x}^ ^p{v,x} _ p{v,w,x}^ ^p{v,w} _ p{v,w,x}^ 



dp{v,™,x} 6 yi(m - 1) p {v,w,x} ( p {v,w,x} _ p-j 



Provided p^ < p + , the function in (6.9) is strictly decreasing from oo to — oo on \p$ , p + ], so 
it has a unique zero within this interval, and this zero is where H attains its maximal value 
on [pq ,p + }. Finding this zero involves solving a cubic equation, which yields the maximum 
entropy closure for r\ 



„vx 

VW 



Figure 10 indicates how well this closure performed in the K(12, 3, 0.5, 16, 4, 2, 18) exam- 
ple. The histograms show good agreement: each has values that are predominantly positive, 
with a peak in the same place, and distributions of similar shapes, although the closure 
distribution is a little more spread out. This suggests that using this closure for along 
with the fourth-order closure is a promising idea. Unfortunately, this closure has nothing 
built into it that ensures p^ v > w ^ stays in the range [0, 1], and, indeed numerical simulations 
that use it quickly produce probabilities outside this range. 

To see why it fails, consider the consistency requirement p" < p + mentioned above. This 
may be re-written 

|pMM _pMM| < pM-M < p[v}{x} p{w}{x} ^ (6.10) 

i.e., the probabilities satisfy the triangle inequality. In particular, if p^ v ' w ^ = 1, then 

p{v,x} _ p{w,x} f Qr a vj noc [ es x ^ v,w. This makes sense: if v and w are definitely in the same 
community C, then both pi v > x ^ and p^ w ' x ^ mean "the probability that x G C," so it would 
be illogical for these values to differ. Furthermore, in the special case a = 0, if p^' tt > = 1 at 
some time, then it will remain 1, which implies p^ v ^ = p{ w > x }. The direct verification of this 



fact using (5.21) leads to an expression in which S vw = —R vw ^ 0. In this case, the fourth- 
order terms play a crucial role in maintaining consistency, so it is not surprising truncating 
them entirely leads to inconsistencies. This is but one of the issues that must be addressed 
before principled community tracking algorithms along these lines can be developed, but we 
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believe there is much promise in this approach of Bayesian filtering using formal evolution 
and measurement models 



7 Conclusion 

Network science has benefited from the perspectives and expertise of a variety of scientific 
communities. The data fusion approach has much to offer as well. It provides an integrated 
framework for synthesizing high-level situational awareness from messy, real-world data. It 
also develops tracking algorithms based on formal evolution models that maintain represen- 
tations of the uncertainty of the ground-truth state. We have applied this perspective to the 
community detection problem in network science. We began with a derivation of the poste- 
rior probability distribution of community structure given some graph data: this is similar 
to the approach of [531 El]- However, rather than seeking the community structure that 
maximizes this posterior probability, we developed approximations to the marginals of these 
distributions. In particular, we consider the pairwise co-membership probability p^ v ' w J-. the 
probability that nodes v and w are in the same community. We develop an estimate of p^ v ' w s 
based on using limited information from the graph G and approximating an integral over 
the model's structural parameters. The resulting method is very fast, and, when exploited 
to produce a single community detection result (via the utility formulation in Appendix [A]) 
yields state-of-the-art accuracy. Various uses for these p{ v > w } quantities are combined in the 
network analysis and visualization product IGNITE. 

We extended our community detection approach to tracking the evolution of time- varying 
communities in time-varying graph data. We proposed dynamic analogs of the stochastic 
blockmodel and planted partition models used in static community detection: these models 
are continuous-time Markov processes for the joint evolution of the communities and the 
graph. We derived a Bayesian filter for the current probability distribution over all com- 
munity structures given the previous history of the graph data. This filter decomposes into 
prediction steps (during periods of constant graph data) and update steps (at times when the 
graph changes). The filter is over too large a state space to use directly, so we marginalized 
it to get state spaces of a reasonable size. These marginalized equations require closures for 
their higher-order terms, and we discussed one possible closure based on maximum entropy. 

The community detection work could be extended to more realistic graph models, such 
as the degree-corrected blockmodel of [H] , and the integral approximation developed in Sec- 



tion 3.2 could certainly be improved. There is much more work to do in the community 
tracking case, however. In the spectrum of methods that handle dynamic network data, 
the models presented are intermediate between those that use the data only to discern a 
static community structure (e.g., [67] ) and those mentioned in Section [5] that allow not just 
individual node movements, but the birth, death, splitting, and merging of communities. 
Extending this methodology to account for these phenomena would be important for appli- 
cations. Whatever model is used, some method for parameter selection must be developed. 
Integrating over the parameter space may be too complicated in the community tracking 
case, but one may be able to extend the Bayesian filter by making the input parameters 
themselves hidden variables, and use Hidden Markov Model (HMM) techniques for param- 
eter learning [4J. Closures must be developed, preferably with proofs of their properties, 
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rather than just experimental justification. Finally, to be truly useful, community tracking 
need to be able to incorporate ancillary, non-network information about the properties of the 
nodes and edges involved. Bayesian data fusion provides an excellent framework for coping 
with this kind of practical problem by identifying what expert knowledge is needed to extend 
the uncertainty management in a principled manner. 
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A Expected Utility Formulation 

Figure [TT] depicts the data fusion approach this paper takes toward community detection 
and tracking. Part (1) represents the model of the hidden state space in which we are 
interested (in this case the community assignments <fi or community partitions 7r), including 
some prior probability distribution over it, and, in the dynamic case, an evolution model. 
Part (2) represents observed data, and the red arrow from (1) to (2) is the measurement 
model which assigns conditional probabilities of the data given the true state. Part (3) is 
Bayesian inversion, which yields the conditional probabilities of the true state given the data. 
When the number of possible states is large, the combinatorial explosion of part (3) must 
be dealt with in some manner. Thus, part (4) is some reduced representation of the full 
posterior distribution in (3): in this case the pairwise co-membership probabilities p^ v > w \ 
Finally, part (5) represents the end user of this inference process. The full posterior could, 
in principle, answer every question the end user might have about the true state, given the 
limitation of the data available. The reduced representation (4) should be chosen in such a 
way that it may be gleaned efficiently from (3), but also meet the needs of the end user (5). 

The body of this paper addresses parts (l)-(4) for the community detection and tracking 
problems. This appendix connects that analysis to part (5). Bayesian decision theory [5] 
provides a way to formalize the connection. Consider the problem of how to assess the quality 



37 



of some proposed (or computed) partition tt. The usual approach is to stipulate some quality 
function of tt and the graph G. Bayesian decision theory argues for assessing the quality 
of tt in terms of its agreement with the true partition tt. Thus, with the cooperation of 
the end user in (5), we should construct a utility function U(tt\tt) which defines the value 
of being provided the answer tt when the true partition is tt. Because the ground truth 
partition tt is unknown, we cannot compute U(tt\tt). However, if we know the posterior 
distribution Pr(7r|G), we can compute the expected utility E[U](ir\G). In general, summing 
over all possible partitions tt to get this expected utility is not practical, but we shall provide 
a generic example in which the summation reduces to one over all pairs of nodes, with the set 
of pairwise co-membership probabilities p{ v,w } providing all necessary statistical information. 
In this sense, the values of p^ v ' w ^ comprise the reduced representation in part (4) of Figure 11 
Let V = [n] denote the set of n nodes, and H{V) denote the set of all partitions of V. For 
any particular tt G n(V), let U(tt\tt) denote the utility of the decision that tt is the correct 
partition when tt is in fact correct. For any graph G on V, one may define the quality of tt 
for G to be the expected value of its utility: 



E[U](n\G) 



- ^k)Pr(7r|G), 
7ren(v) 



(A.l) 



where Pr(7r|G) is the posterior probability of tt given G. To simplify (A.l), assume U(tt\tt) 
is additive: 



i.e., that it can be expressed as 

U(tt\tt) = 



5>(tt[C]) 



(A.2) 



for some function u, where tt[C] is the partition of C induced by tt (that is, the sets into 
which tt "chops up" C). In this case, one can compute the expected utility of each community 
C E tt given G, 

E[U](C\G) = J2 «(7r[C])Pr(7r|G), (A.3) 
7ren(v) 



and sum them over C G tt afterward to get E[[/](7r|Cr). Grouping the sum in (A.3) into 
classes based on the value of tt[C] yields 



E[U}(C\G)=J2 u ^J2 Ft W G ^ 

pen(C) n:n[C]=p 



(A.4) 




Figure 11: Five-part inference framework 
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The inner sum in (A. 4) is the probability of p given G, denoted Pr(p|G). By grouping the sum 



in (A. 3) less aggressively, however, one obtains a more general result which includes (A. 3) 



and (A. 4) as special cases. For any C + D C, 



E[U](C\G) = u(p + [C})Pr(p + \G). (A.5) 
p+en(c+) 

In particular, we let C + — C U {v } and define 

6(6, v) = E[U](C+\G) - E[U](C\G) - E[U]({v}\G) (A.6) 

to be the increase in expected utility of merging the community C with the singleton com- 
munity {v}. Using ( |A.5[ ), 6(C, v) may be expressed as 

S (C,v)= (u(p+)-u(p + [C})-u({{v}}))Pr(p+\G). (A.7) 

p+en(c*+) 

This equation allows the expected utility of a community to be computed by adding one 
node at a time. 

A.l A generic utility model 

Here we consider a simple, generic utility function in which some node v has been identified 
as "bad," implying that the nodes in his ground-truth community C = tt(v ) are bad as well, 
but that decision makers are using some other partition fr, resulting in all nodes in C = ft(v) 
ending up "dead." Let ubd and uba be the respective utilities of a bad node being dead 
and being alive, and uqd and uqa be the corresponding utilities for good nodes. Summing 
the utilities over the cases v £ V being identified as bad yields the utility function 

f7*(7r|7r) ={u BD - uba - u GD + u GA ) \C H C\ 2 + 

(U BA - U GA ) ^ \ C \ 2 + ( U GD - UGA) ^ l^l 2 + U GAn 2 

after some rearrangement. Assuming u G a > ugd and ubd > uba (and equality does not 
hold for both), a convenient choice of translation and scaling yields 

u(*\*) = U \cnc\*-eJ2\c\*-(i-e)n\ (A.9) 

where the threshold 6 = (uga — ugd) / (uga — ugd + ubd — uba) € [0, 1] reflects the emphasis 
of killing bad nodes (small 9, yielding large communities) versus not killing good nodes (large 
9, small communities). This utility function is additive: it may be decomposed according 



to ( |A.2D with 

<p) ^(E^l 2 -^! 2 -^-^!)' ( a - 10 ) 
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for any p G H(C). Note that u(p) = for singleton partitions p = {{v}}. 
Substituting into ( |A.10p into §KJ\ yields 

= (\p + (v)\-l-6\C\)Pr(p + \G). (A.ll) 

p+en(c+) 

This may be written 5(C,v) = E[p + (t>) — 1\G] — 0\C\. The first term here is the expected 
number of nodes in C that are in the same community as v in a random partition of C + , 
given the evidence G. Linearity of expectation implies 

Elp+(v)-l\G] = J2/ {vM > (A-12) 

where p^ v ' w ^ = Pr({{v, w}}\G) is the probability that v and w lie in the same community, 
given G. This may be used to build up E[[/](C , |G) by adding nodes v to C one at a time, 
which yields the simple formula 

E[U){C\G) = (P iv,w} ~ 9 )- ( A - 13 ) 

{v,w}CC 

Summing this over C yields an expression for the expected utility of a partition n given the 
graph G which involves only sums over pairs of nodes in the same community C G tt: 

E[U](n\G)=J2 E {P {vM -0)- (A-14) 

CGt^v,w}CC 
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